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Abstract 

Background: Enumeration of chemical graphs satisfying given constraints is one of the fundamental problems in 
chemoinformatics and bioinformatics since it leads to a variety of useful applications including structure 
determination of novel chemical compounds and drug design. 

Results: In this paper, we consider the problem of enumerating all tree-like chemical graphs from a given set of 
feature vectors, which is specified by a pair of upper and lower feature vectors, where a feature vector represents 
the frequency of prescribed paths in a chemical compound to be constructed. This problem can be solved by 
applying the algorithm proposed by Ishida ef al. to each single feature vector in the given set, but this method 
may take much computation time because in general there are many feature vectors in a given set. We propose a 
new exact branch-and-bound algorithm for the problem so that all the feature vectors in a given set are handled 
directly. Since we cannot use the bounding operation proposed by Ishida ef al. due to upper and lower 
constraints, we introduce new bounding operations based on upper and lower feature vectors, a bond constraint, 
and a detachment condition. 

Conclusions: Our proposed algorithm is useful for enumerating tree-like chemical graphs with given upper and 
lower bounds on path frequencies. 



Introduction 

Development of novel drugs is one of the major goals in 
chemoinformatics and bioinformatics. To achieve this pur- 
pose, it is important not only to investigate common che- 
mical properties over chemical compounds having 
common structural patterns [1-3] but also to study meth- 
ods of enumerating chemical structures satisfying given 
constraints. The enumeration of chemical structures has a 
long history. Actually, Cayley [4] considered the enumera- 
tion of structural isomers of alkanes in the 19th century. 
Applications for the enumeration of chemical compounds 
include structure determination using mass-spectrum 
and/or NMR-spectrum [5,6], virtual exploration of chemi- 
cal universe [7,8], reconstruction of molecular structures 
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from their signatures [9,10], and classification of chemical 
compounds [11]. 

In the field of machine learning, the pre-image problem 
[12,13] has been studied. In this problem, a desired object 
is computed as a feature vector in a feature space, and 
then the feature vector is mapped back to the input space, 
where this mapped back object is called a pre-image. The 
definition of the feature vectors based on the frequency of 
labeled paths [14,15] or small fragments [11,16] has been 
widely used. Akutsu and Fukagawa [17] formulated the 
graph pre-image problem as the problem of inferring 
graphs from the frequency of paths of labeled vertices, 
which corresponds to the pre-image problem, and proved 
that the problem is NP-hard even for planar graphs with 
bounded degrees [17]. Nagamochi [18] proved that a 
graph determined by frequency of paths with length 1 can 
be found in polynomial time if any. 

To enumerate tree-like chemical graphs, Fujiwara 
et al. [19] proposed a branch-and-bound algorithm 
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which consists of a branching procedure based on the 
tree enumeration algorithm due to Nakano and Uno 
[20,21] and bounding operations designed by the path 
frequency and the atom-atom bonds. In addition, to 
reduce the size of search trees, Ishida et al. [22] intro- 
duced a new bounding operation, called the detach- 
ment-cut, based on the result by Nagamochi [18]. 
Implementations of the algorithm proposed by Ishida 
et al. [22] are available at a web server (http:/ '/sunflower. 
kuicr.kyoto-u.ac.jp/tools/enumol/) for enumerating tree- 
like chemical graphs with given path frequency. How- 
ever, an instance with constraint which is specified by 
one feature vector admits no solution in many cases. 
Therefore, it is needed to introduce a more relaxed con- 
straint than a single feature vector to obtain some solu- 
tions in the tree-like chemical graph enumeration 
problem. 

In this paper, we are given a set of feature vectors, which 
is specified by a pair of upper and lower feature vectors, 
and enumerate all tree-like chemical graphs satisfying one 
of the vectors. It seems that this can be done by simply 
applying the algorithm proposed by Ishida et al. to each 
single feature vector in the given set. However, this 
method will take much computation time because in gen- 
eral there are many feature vectors in a given set. We pro- 
pose a new exact branch-and-bound algorithm for the 
problem so that all the feature vectors in a given set are 
handled directly. 

Methods 

Preliminaries and problem formulation 

A graph is called a multigraph if multiple edges (i.e., 
edges with the same end vertices) are allowed; otherwise 
it is called simple. A path P is a sequence v 0 , e lt v x , e 2 , v 2 , 
et, v k of distinct vertices v t (i = 0, k) and edges e y 
that join v y _ 1 and v y (/' = 1, k). Without confusion we 
may write P = (v 0 , v lt v k ). The length |P| of path P is 
defined to be k, i.e., the number of edges. Assume that a 
set £ = {l\,ti, ...,€ s } (i.e., chemical elements) is given. Let 
each label £ be associated with a valence val( I) e Z + . A 
multigraph G is called 'L-labeled if each vertex v has a 
label £(v) e Z, and is called (£, val)-labeled if, in addition, 
the degree of each vertex v is val(£(v)), i.e., the valence of 
the element £(v). We regard chemical compounds as (2, 
val )-labeled, self-loopless, and connected multigraphs, 
where vertices and labels represent atoms and elements, 
respectively. For a path P = (v 0 , V\, v k ), we call £(P) = 
£(v 0 ), £(vi), £(v k ) the label sequence of P. Given a label 
sequence t, let #t denote the number of paths P with £(P) 
= t in a graph, where multiple edges with the same end- 
vertices are treated as a single edge and paths are consid- 
ered to be "directed." The feature vector f K (G) of level K 
(e Z + ) of G is defined to be the vector whose entry fidG) 
[t] (\t\ < K) represents #t. See Fig. 1 for an example. 
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Figure 1 A chemical compound and its feature vector An 

illustration of a (£, va/)-labeled multitree G and its feature vector f~i 

(G). Notice that multiple edges with the same end-vertices are 

treated as one edge, where #0C = fCO = 2. 
v J 

Let deg(v; G) denote the degree of a vertex v in a 
graph G. The tree-like chemical graph enumeration pro- 
blem with given one feature vector can be formulated as 
follows [19]. 

Enumeration of Tree-like chemical graphs with given Path 
Frequency (ETPF) 

Given a set X of labels, a valence function val : S —> Z + 
and a feature vector g of level K, find all (£, val) -labeled 
multitrees T such that^T) = g and deg{v;T) = val(£(v)) 
for all vertices v e V(T). 

Observe that a large number of chemical compounds 
contain a high proportion of hydrogens. Based on this 
fact, another model can be considered in the problem 
ETPF by removing all hydrogen atoms. These two dif- 
ferent models were proposed by Fujiwara et al. [19] and 
Ishida [23]. 

In this paper, we consider the problem of enumerating 
all tree-like chemical graphs based on given upper and 
lower feature vectors because we want to relax the fea- 
ture vector constraint in the problem ETPF. For feature 
vectors gi and g 2 of level K, we define g x < g 2 to be gi[t] 
< g 2 [t] for any label sequence t (\t\ < K). The problem 
of enumerating tree-like compounds from given two fea- 
ture vectors can be formulated based on the problem 
ETPF as follows (see Fig. 2 for an illustration). 
Enumeration of Tree-like chemical graphs with given Upper 
and Lower bounds on path Frequencies (ETULF) 
Given a set X of labels, a valence function val : £ — > Z+ 
and feature vectors g u and g L of level K (g L < g u ), find 
all (£, val) -labeled multitrees T such that g L < f<{T) < g u 
and deg(v;T) = val{£{v)) for all vertices v e V{T). 

For the problem ETULF, we assume that g L (£) = gu(£) 
for an atom type I e Z, where g(L) denotes the entry in 
g that corresponds to a label sequence L (thus g(£) spe- 
cifies the number of vertices of label £) and that g L {L) < 
gu(L) for any label sequence L (|L| > 2). 

Note that the number n of vertices is given by £ fez g(£). 
To solve the problem ETULF, we start with an empty 
graph, and repeatedly extend the current tree T by 
appending a new vertex with each label I e 2 to obtain a 
valid tree (a tree that does not violate any constraints on 
output trees) one by one until we get n vertices. In order 
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Figure 2 An instance of ETULF. An instance of ETULF with upper and lower feature vectors, which admits two different solutions. 



to avoid duplicate outputs, we follow the branch-and- 
bound framework of Fujiwara et al. [19], which first 
defines a canonical representation for isomorphic trees, 
and then lists them using the algorithm of Nakano and 
Uno [20,21] (the branching operation) discarding invalid 
trees with some bounding operations. Since we cannot 
directly use the bounding operation proposed by Ishida 
et al. [22] due to upper and lower constraints, we intro- 
duce some new bounding operations. 

Canonical representation of trees and the branching 
operation 

In this section, we explain a canonical representation of 
trees introduced by Fujiwara et al. [19] and the branching 
operation based on the canonical representation. 

First of all, we introduce a root of a tree based on the 
following theorem. 

Theorem 1 (Jordan [24]) For any tree with ri vertices, 
either there exists a unique vertex v* such that each subtree 
obtained by removing v* contains at most^ 21 -^- J vertices, or 
there exists a unique edge e* such that both of the subtrees 
obtained by removing e* contain exactly y vertices. 

Such a vertex v* and an edge e* in Theorem 1 are 
called unicentroid and bicentroid, respectively. Either 
unicentroid or bicentroid is called as centroid. Note that 
there exists a bicentroid only for an even Since a 
case of bicentroid is similar to a case of unicentroid, 
now we only explain a case of unicentroid. 

Next we introduce a canonical representation of trees 
that must be unique up to isomorphism. Let T be a tree 
of n vertices rooted at a vertex v 0 (which is not necessa- 
rily its unicentroid). Suppose that it is embedded in the 
plane as an ordered tree, where v 0 is located at the top 
part. Without loss of generality, let v 0 , v lt v n _ i be 
indexed by the depth-first search (DFS) that starts from 
v 0 and visits vertices from the left to the right. Define 
the depth d{v) of a vertex v to be the length of the 
(unique) path from v 0 to v in T. The depth-label 
sequence of T (L{T)) is defined to be 



L(T) = (d(v 0 ),£(v 0 ),d{v 1 ),£{v 1 ) 

Given an arbitrary order of labels, we define the order 
of depth-label sequences as follows. For any Tj and T 2 , 
we denote i(7\) >L(T 2 ) if £(7\) is lexicographically lar- 
ger than L(T 2 ). Then the canonical representation of a 
rooted tree is defined by the largest depth-label 
sequence among all its plane embeddings. Actually this 
is equivalent to the left-heavy plane embedding [20,21]. 

Thus our branching task is to list all centroid-rooted 
left-heavy trees with n vertices and m (= \L\) labels. Fol- 
lowing the scheme [20,21], we define a parent-child 
relation between two left-heavy trees. The parent P(T) 
of a left-heavy tree T is obtained from T by removing 
its rightmost leaf. Clearly P(T) is still left-heavy In this 
way, we can define a family tree T(n,m) of left-heavy 
trees whose leaves are exactly what we want to obtain. 

Therefore we only need to enumerate the (leaf) nodes 
of T(n,m) . This can be done by starting from the 
empty tree (the root node of T[n, m) ) and repeatedly 
appending a new leaf to some appropriate place on the 
rightmost path of the current tree. Our branching 
operation employs the algorithm of Nakano and Uno 
[20,21], which extends the current tree T (i.e., finds a 
child of T) in constant time [19]. 

Bounding operations 

In this section, we explain how to check the validity of 
the current tree T. If we can conclude that T and all its 
descendants are not valid, then we can discard T. Our 
bounding operation discards T if at least one of the fol- 
lowing criteria is violated: 

(CI) The root of T remains the centroid of an output 
(the centroid constraint); 

(C2) deg{v;T) < val(l{v)) for all v e V(T) (the valence 
constraint); 

(C3)MT) < g u , and \T\ = n and & < f K {T) (the fea- 
ture vector constraint); 
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(C4) T can be extended to a connected and loopless 
tree with n vertices (the detachment constraint); 

(C5) T can have a descendant which has an appropriate 
number of multiple bonds (the multiplicity constraint). 

(CI) and (C2) are the same as the work by Fujiwara 
et I. [19] and not difficult to check. (C3) and (C4) are 
different from the work by Fujiwara et al. [19] and 
Ishida et al. [22] due to upper and lower constraints. 
(C5) is a new bounding operation that we propose in 
this paper. In the following three subsections, we will 
discuss three bounding operations resulting from (C3), 
(C4), and (C5), called as feature-vector-cut, detachment- 
cut, and multiplicity-cut, respectively. 
Feature-vector-cut procedure 

In the problem ETULF, we cannot use the bounding 
operation proposed by Fujiwara et al. [19] directly due to 
upper and lower feature vectors, but we can introduce a 
bounding operation based on upper and lower feature vec- 
tors by modifying Fujiwara et al.'s work slightly. 

Let T denote a current tree, fx(T) denote the feature 
vector of T, g u denote a given upper feature vector, and 
g L denote a given lower feature vector. By the feature 
vector constraints in the problem ETULF, we check the 
following condition. 

f K {T) < gu- (1) 

If T violates (1), then we discard T. 

In addition, if | T\ = n, then we check the following 
condition based on the constraint of upper and lower 
feature vectors. 

g L < f K (T) < g u . (2) 

If T violates (2), then we discard T. 
Detachment-cut procedure 

This subsection describes the definition of detachment 
[18] and a new bounding operation based on it for the 
problem ETULF. Let G be a multigraph that may have 
self-loops, which represents the graph obtained from a 
chemical graph H by contracting the vertices with the 
same label into a single vertex, where each vertex in G 
corresponds a label in H (note that we do not eliminate 
any edges in H in contracting vertices to obtain G). A 
process of regaining H from G is described as follows. 
Given a function r : V(G) —> Z + , an r-detachment H of G 
is a multigraph obtained from G by splitting each vertex 
v e V{G) into a set of r{v) copies of v, denoted by W v = 
{v 1 , v 2 v r(v) }, so that each edge {u, v} e E{G) joins 
some vertices u l e W u and v* e W v . Hence an r-detach- 
ment H of G is not unique in general. A self-loop {u, u} 
in G may be mapped to a self-loop {u\u 1 } or a non-loop 
edge {u',u'} in a detachment H of G. Note that, for all ver- 
tex pairs {u, v} e V{G), the number of edges between 



subsets W u and W v in H is equal to that of edges between 
vertices u and v in G. 

To obtain a chemical graph H as an r-detachment H 
of G, we need to specify the degree of vertices (with the 
same label) in H. For a function r : V(G) —> Z + , an r- 
degree specification is a set p of vectors 
p(v) = (pi, p\,... Pr{ v )) f° r v e such that 

^ Pi =deg{v;G), 

l<i<r(v) 

which is necessary for all the edges incident to vertex 
v in G to be assigned to split vertices v' e W v comple- 
tely. An r-detachment H of G is called a p-detachment 
if each v e V satisfies 

deg(v i ;H) = p, 1 ' for all i/ e W v = {v\v 2 ,... i/ M }, 

which is a requirement that each vertex v, in H must 
have the prescribed degree p" . Figure 3 illustrates a 
p-detachment H for a graph G = (V, E) with V = {a, b, c}, 
a function r with r{a) = 4, r{b) = 3, r(c) = 1, and a degree 
specification p with p(a) = (2, 2, 3, 2), p{b) = (2, 3, 1), 
p(c) = (3). The next theorem gives a characterization of a 
multigraph G that admits a connected and loopless 
p-detachment. 

Theorem 2 (Nagamochi [18]) Let G = (V, E) be a 
multigraph, r : V — > Z + and p ;V — > Z r } v \vs V) ■ Then 
G has a connected and loopless p-detachment H if and 
only if the following hold: 

r{X) + c{G-X)-d{X,V;G)<l (VXcV,X* ), 

1 < p\ < d{v; G) + d{ {v} , [v] ; G) (Vv e V, i = 1, 2 r(v)), 

where r(X) = X V6E x f( v )> c(G') denotes the number of 
connected components of a graph G', G - X denotes the 
graph obtained from a graph G by removing the vertices 
in X together with all edges incident to vertices in X, and 




A p-detachment H of G 
G = (V,E) H = (u veV W v , E) 

Figure 3 A multigraph and a p-detachment A multigraph G and 

a p-detachment H of G. 
v J 
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d{A, B; G) denotes the number of edges (u, v) e E with u 
e A and v e B. 

Ishida et al. [22] proposed a bounding operation for 
the problem ETPF based on Theorem 2. However, we 
cannot use the bounding operation proposed by Ishida 
et al. for the problem ETULF due to upper and lower 
constraints. We now describe our new bounding opera- 
tion based on detachments for the problem ETULF. The 
new bounding operation, called detachment- cut tests 
whether the current multitree T has a multitree that is 
consistent with given path frequencies among its des- 
cendants in the family tree, based on the difference 
between the feature vector fx(T) and the input feature 
vectors g u and g L . 

Let t\, l 2 , t s be input labels and g u , g L : 2r K + 1 — > 
Z + be feature vectors. Let r 0 , r h be the vertices in the 
rightmost path to which a new leaf can be appended 
and nf (1 < i < s) denote the number of vertices r, (0 < 
j < h) with t(rj) = ii. For each label sequence t, #t 
denotes the number of paths P in T with €(P) = t. From 
gu> Sl> and T, we define new feature vectors g' u and 
g' L of level K = 1 to be 



g L {l i )-*t i + nf (l<i<5), 
1 (i = s + 1), 

\nf {l<i< s,j = s + 1). 



We next introduce a vertex with a new label £ s+l of 
valence h + 1 (for example, label A in Fig. 4), a graph 
G u = (V u , E u ) with a vertex set V u = {v v v s , v s+1 \ 
£{ v i) = £i, 1 < i < s + 1} and edge set 
E u = I « s = {v,.v j },di{v t },{v j };G u ) = g' u {( t l < i,j < s + 1} , 
and a graph G L = {V L ,E L ) with a vertex set V L = {v lt 
v s , v s +i I £{Vi) = it, I < i < s + 1} and edge set 
El = {en I en = K^MKMi^GJ = uH.i^.X < i,j < s + 1} . 
Note that ^({v,j, {v y j; G) means a multiplicity of the edge 
{v,,v ; } in a graph G. The function r and degree specifica- 
tion p are defined to be 



r{v) = g' u (Zd (l<i<i + l), 

fw'WVj)) K^K r h },l<j<r(i;)), 

[ ra /M>;,))-d e gK;T) + l K-e {r 0 r h },l < j < r(v)). 



1 (i = s + 1), 



?U( ^ j) = V (l<i<5,j = 5 + l), 



Using Gy, Gx, r, and p, we can check if a current mul- 
titree T violates (C4). We need to check whether none 
of the following two conditions is violated. 

(a) deg{v;G L )<y p\ (Vi/e V t ). 

l<i<r(f ) 
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r<H)-5,KO)-3,KC)-3,KA)- 
p(H) = (1,1,1,1, l),p(0) = (2,2,2), 
p(C) = (4,3,3),p(A) = (3) 



1 




G ~ G 

a t 

Figure 4 Detachment-cut. Bounding operation by detachment-cut, where vectors g u {£, l\ g L (i, i), g' u {l,(') , and g' L [£,f) are defined for 
unordered pairs {&, £1 and those with value=0 are omitted in the tables. 



Shimizu et al. BMC Bioinformatics 2011, 12(Suppl 14):S3 
http://www.biomedcentral.eom/1 471 -21 05/1 2/S1 4/S3 



Page 6 of 9 



(b) r(X) + c(G u - X) - d(X, V u ; G u ) <1(VI£ y u , 
X* 0). 

In the first condition, we check whether the number 
of the rest of bonds is large enough to satisfy the lower 
feature vector constraint. In the second condition, we 
check whether T has a connected and loopless descen- 
dant based on G u and Theorem 2. 
Multiplicity-cut procedure 

This subsection describes a new bounding operation 
based on multiplicity for the problem ETULF. Let g(€) 
be the number of vertices with label £ e 2 that are 
obtained from given the feature vector. Now we assume 
that g{€) for all € e 2 are fixed in the problem ETULF. 
Then we can calculate the number of edges in output 
trees in the problem ETULF. Let n be the number of 
vertices in output trees. If we treat a multiple edge as a 
set of single edges, the number of edges e m in an output 
tree is given by: 

e m =^val{e)g{e). 

tel. 

On the other hand, if we treat a multiple edge as a 
simple one, the number of edges e s in an output tree is 
equal to n - 1 due to the tree-like constraint. Now we 
consider 

M = e m -e s , 

which means that only M edges are used to construct 
multiple bonds in an output tree. Note that M > 0. We 
calculate M from an input of the problem ETULF before 
the enumeration algorithm starts. 

Let T = (V, E) be a multitree, and m e denote the mul- 
tiplicity of e e E. The multiplicity M(T) of T is defined 
to be 

aC0 = £k-i). 



Now we describe the multiplicity-cut based on M{T) 
and M. 

Let T be the current rooted multitree in the branching 
operation, M{T) be the multiplicity of T, RP{T) = (r 0 , r lt 
r k ) be the rightmost path of T, T, be the new rooted 
multitree obtained by appending a new leaf p to a vertex 
r, (0 < i < k), and RP(Tj) be the rightmost path of T ; . The 
rightmost path RP(Ti) of T t is updated by appending p to 
the end of RP(T) when a new leaf p is appended to r it 
that is, RP{Tj) = (r 0 , r lt ..., r h p). Then we can determine 
the multiplicities of the edges {(r ; , r, _ 1 ), j = k, k - 1, i 
+ 1} due to the valence constraint, at the same time, we 
update M(Tj). We denote the multiplicity of an edge (r,, 
Tj _ i) in T t by Mul(rj, r y - _ i | T t ). When we update the 
multiplicity of the edge (r^rj _ 1 ), M(T,) is updated as 
follows: 

MIT ■= 1 M(T) + MUKrk ' 1 T,) ~ 1 {i = k) 

1 ih ymj^ + MuVjyT^ | T f ) - 1 (i + i<j<fe-i). 

By the definition of M, a valid multitree T, satisfies 

M(T ; ) < M. (3) 

If Ti violates (3), then we discard T, . See Fig. 5 for an 
illustration of this. 

Results 

This section reports the experimental results of our algo- 
rithm. First of all, we mention that the problem ETULF 
can be solved by applying the algorithm proposed by 
Ishida et al. [22] to each single feature vector in a given 
set of feature vectors, i.e., the problem ETULF can regard 
as a set of the problem ETPF. Then we call an algorithm 
for the problem ETULF based on the algorithm proposed 
by Ishida et al. RepEnum (Repeated Enumeration). On 
the other hand, we call our algorithm SimEnum (Simul- 
taneous Enumeration). It is to be noted that RepEnum is 
one of the fastest tools to enumerate tree-like chemical 
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structures from a given molecular formula (i.e., feature 
vector with K = 0) [22] and, to our knowledge, there does 
not exist any other available tool to enumerate chemical 
structures from a given feature vector based on path fre- 
quency (i.e., feature vector with general K). 

Now we compare the performances of two algorithms, 
SimEnum and RepEnum, and we also compare the per- 
formances of two algorithms, SimEnum including multi- 
plicity-cut and SimEnum not including multiplicity-cut. 
We have tested the algorithm SimEnum for some widths 
between upper and lower feature vectors. Tests were 



carried out on a PC with CPU AMD Athlon Dual Core 
Processor 5050e using instances based on some chemical 
compounds selected from the KEGG LIGAND database 
[25] {http://www.genome.jp/ligand/). Note that we treat a 
benzene ring contained in these compounds as a new vir- 
tual atom of valence six. 

We define we Z + to be a width between upper and 
lower feature vectors. From a feature vector g, we con- 
struct two feature vectors g u and g L as follows. For each 
entry a > 0 of g, let g u be the upper feature vector, where 
each entry a u is given by a + w and g L be the lower one, 



Table 1 Comparison of previous method and our method 



Entry Formula 



SimEnum 



RepEnum 





n 


K 


W 


f v 


time (s) 


nodes 


solutions 


time (s) 


nodes 


solutions 


solved 






1 




3 6 


1 037 04 


1 77 074 686 


414890 


163 32 


44 34n 488 


414890 


729 






2 




3 18 


2 97 


392 246 


44 


T.O. 


1 381 ^60 nnn 


NT. 


6S Q09 572 






3 




3 34 


1 22 


145 21 3 


2 


T.O. 


3 293 260 nnn 


NT. 


96 860 588 




26 


4 




3 53 


0.33 


34 539 


1 


T.O. 


2 780 050 000 


NT. 


81 766 1 76 


CM N,D 




5 




3 7 ' 


024 


20 361 


1 


T.O. 


1 SfSi 930 nnn 


NT. 


HJ, 17 I 0,JZ7 






5 




3 85 


025 


15 166 


\ 


T.O. 


cj^q son nnn 


NT. 


1 fS 7S9 647 

I L), / JZ,Ut/ 






7 




-,96 

3 


0.18 


14,547 


1 


T.O. 


79,870,000 


NT. 


2,349,1 17 






1 




3 6 


T.O. 


377,260,000 


NT. 


T.O. 


413,000,000 


NT. 


460 






2 




3,8 


7.24 


845,760 


25 


T.O. 


1 ,442,760,000 


NT. 


70,175,902 






3 




3 31 


2.81 


307,151 


7 


T.O. 


3,316,970,000 


NT. 


195,115,882 


C03343 


37 


4 




3 47 


1.03 


99,945 


1 


T.O. 


2,494,780,000 


NT. 


146,751,764 


C16H22O4 




5 




364 


0.98 


87,600 


1 


T.O. 


1 ,050,480,000 


NT. 


61,792,941 






6 




382 


0.76 


60,194 


1 


T.O. 


315,820,000 


NT. 


1 8,577,647 






7 




399 


0.57 


42,538 


1 


T.O. 


41,450,000 


NT. 


2,438,235 






1 




3 s 


T.O. 


157,320,000 


NT. 


T.O. 


200,490,000 


NT. 


1,388 






2 




3 26 


37.59 


1,940,295 


238 


T.O. 


2,911,390,000 


NT. 


66,167,954 






3 




348 


1.71 


60,792 


3 


T.O. 


2,673,940,000 


NT. 


60,771,363 


C07178 


46 


4 




3 71 


0.35 


14,248 


1 


T.O. 


1,925,490,000 


NT. 


43,761,136 


C 21 H 28 N 2 0 5 




5 




392 


0.27 


10,866 


1 


T.O. 


743,940,000 


NT. 


16,907,727 






6 




3110 


0.27 


10,680 


1 


T.O. 


93,880,000 


NT. 


2,133,636 






7 




3,25 


0.24 


9,276 


1 


T.O. 


19,270,000 


NT. 


437,954 






1 




3 5 


T.O. 


382,470,000 


NT. 


T.O. 


552,290,000 


NT. 


61 






2 




3,6 


T.O 


211,800,000 


NT. 


T.O. 


530,930,000 


NT. 


10,451,912 






3 




3 27 


1395.13 


144,244,042 


206 


T.O. 


3,314,260,000 


NT. 


1 94,956,470 


C03690 


61 


4 




3 4, 


121.36 


11,332,363 


4 


T.O. 


2,392,530,000 


NT. 


140,737,058 


C24H38O4 




5 




3 57 


83.70 


6,978,557 


2 


T.O. 


958,650,000 


NT. 


56,391,176 






6 




3 75 


40.11 


2,923,819 


1 


T.O. 


298,600,000 


NT. 


1 7,564,705 












16.50 


1,096,128 


1 




38,670,000 


NT. 





Comparison of SimEnum and RepEnum for the problem ETULF. 
Note: 

(1) 00062, C03343, C07178, and C03630 are the chemical compounds in the KEGG LIGAND database, respectively; 

(2) n is the number of vertices in an instance preprocessed by replacing each benzene ring with a new atom having six valences; 

(3) K is the level of given feature vectors; 

(4) w is the width for constructing upper and lower feature vectors; 

(5) f v is the number of feature vectors in a given set; 

(6) "time (s)" is the CPU time in seconds; 

(7) T.O. means "time over" (the time limit is set to be 1,800 seconds); 

(8) "nodes" is (the sum of) the number of nodes of family trees that are traversed; 

(9) "solutions" is the number of all possible solutions; 

(10) "solved" is the number of feature vectors which the algorithm RepEnum solved in the time limit; and (11) N.F. means "not found." 
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where each entry a L is given by max{0, a - w}. Note that 
if w = 0, then an instance for the problem ETULF is 
equivalent for the problem ETPF. 

Table 1 and Additional file 1 show the results of the 
comparison. We find that the algorithm RepEnum can- 
not solve all the problems with K = 2 within the time 
limit since the number of feature vectors in a given set 
is exponentially increasing with K. On the other hand, 
Table 1 shows that the algorithm SimEnum can solve 
the problem much faster for a larger K. This shows that 
the algorithm SimEnum runs significantly faster than 
the algorithm RepEnum. It is also seen that RepEnum 
can only examine a very small portion of feature vectors 
in most cases. Additional file 1 shows that the algorithm 
SimEnum including multiplicity-cut runs faster than the 
algorithm SimEnum not including multiplicity-cut for 
almost all of the instances. This shows that the multipli- 
city-cut operation works well to improve enumeration 
efficiency. 

Table 2 shows the results on the performance for 
varying width w for the problem ETULF. The search 
space in the problem ETULF is exponentially increasing 
with w. However, it seems that the number of search 
nodes and computation time are not exponentially 
increasing with w. This suggests that the algorithm 
SimEnum works efficiently for the large search space in 
the problem ETULF. 

Here, we briefly discuss practical values on K and w 
though we do not have concrete evidence and these 
values depend on target classes of chemical compounds. 
It is suggested from the results on similar feature vectors 
[9,10,15] that K between 3 to 10 should be used. Though 
there is no previous result on w, it is seen from Table 2 
that w cannot be large because there may exist too many 
solutions. Therefore, w less than 4 should be used. 

Conclusions 

We considered the problem of enumerating all tree-like 
chemical graphs from a given set of feature vectors, 
which is specified by upper and lower feature vectors 
based on frequencies of paths, and proposed a new 
exact branch-and-bound algorithm. Our experimental 
results show that our algorithm outperforms the naive 
algorithm based on a previous method. In comparison 
to the algorithm based on Ishida et al. [22], our algo- 
rithm can greatly reduce the number of search nodes 
and the computation time and enumerate all the feasible 
solutions in many instances. 

However, the search space of the problem ETULF is 
much larger than that of the problem ETPF due to 
upper and lower constraints and in fact there are many 
search nodes for solving the problem ETULF by our 
algorithm. One of the future works is to improve the 
bounding operations, or introduce a new bounding 



Table 2 Comparison of varying width 



Entry Formula 










SimEnum 






n 


K 


w 


time (s) 


nodes 


solutions 






2 


0 


0.51 


55 1 96 


6 






2 


1 


3 5g 


400 501 


44 






2 


2 


7 58 


835 509 


503 


COOOfi? 


26 


2 


3 


1 0.84 


1,163,548 


2 351 


^6 n 14 IN 2^4 




2 


4 


1 2.55 


1 349 057 


5 430 






2 


5 


13.29 


1 ,43 1 ,075 


9,852 






2 


50 


14.31 


1,537,496 


25,425 






2 


0 


0.34 


35 952 


9 






2 


1 


8 39 


845 760 


25 






2 


2 


48 27 


4815 369 


41 


C03343 


37 


2 


3 


1 49.83 


14^81,738 


305 


r,.H_0. 




2 


4 


377.01 


3743S 878 


40 732 






2 


5 


639.68 


63,459,180 


106,870 






2 


50 


1 1 1 8.75 


1 10,703,034 


510,079 






2 


0 


2 33 


1 1 1 781 


1 6 






2 


1 


46 81 


2 246 578 


238 






2 


2 


96.52 


4715 072 


1 375 


C071 78 


46 


2 


3 


1 52.1 8 


7 470 060 


6 824 


*_21 n28'^2 w 5 




2 


4 


1 79.42 


8 744 563 


19 180 






z 


c 
J 


1 QQ 




zy,oy 1 






2 


50 


255.01 


12,292,587 


54,861 






5 


0 


19.50 


1,482,017 


2 






5 


1 


220.14 


16,063,569 


5 






5 


2 


439.12 


33,037,741 


32 


C03690 


61 


5 


3 


684.88 


52,207,745 


178 


C24H38O4 




5 


4 


1 024.96 


78,509,554 


349 






5 


5 


1285.55 


98,762,291 


615 






5 


50 


T.O. 


136,835,134 


N.F. 



Comparison of the performance for varying w for the problem ETULF. 



operation. Actually, in the feature-vector-cut mentioned 
in subsection , information of a lower feature vector g L 
is only used if | T\ = n. Another future work is to 
develop a web server that implements our proposed 
algorithm. Generalization of the proposed techniques 
for other types of kernel functions and other problems 
is also left as a future work. 

Additional material 



Additional file 1: Comparison of multiplicity-cut Comparison of 
SimEnum including multiplicity-cut and SimEnum not including 
multiplicity-cut for the problem ETULF. Note: (1) "add multiplicity-cut" is 
the algorithm SimEnum including multiplicity-cut; and (2) "no 
multiplicity-cut" is the algorithm SimEnum not including multiplicity-cut. 
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